Backpropagation Through Time (part c)

Last step! Adjusting W_x, the weight matrix connecting the input to the state.

If you took on the previous challenge of deriving the math by yourself first, sit back, fasten your seat belts and compare our notes to yours! Don't worry if you made mistakes, we all do. Your mistakes will help you learn what to avoid next time.

21 RNN BPTT C V7 Final

Gradient calculations needed to adjust W_x

To further understand the BPTT process, we will simplify the unfolded model again. This time the focus will be on the contributions of W_x to the output, the following way:

_Simplified Unfolded model for Adjusting Wx_ — *Simplified Unfolded model for Adjusting Wx*

When calculating the partial derivative of the Loss Function with respect to to W_x we need to consider, again, all of the states contributing to the output. As we saw before, in the case of this example it will be states \bar{s_3} which depend on its predecessor \bar{s_2} which depends on its predecessor \bar{s_1}, the first state.

As we mentioned previously, in BPTT we will take into account each gradient stemming from each state, accumulating all of the contributions.

At timestep t=3, the contribution to the gradient stemming from \bar{s_3} is the following :
(Notice the use of the chain rule here. If you need, go back to the video to visualize the calculation path).

At timestep t=3, the contribution to the gradient stemming from \bar{s_2} is the following :
(Notice how the equation, derived by the chain rule, considers the contribution of \bar{s_2} to \bar{s_3} . If you need, go back to the video to visualize the calculation path).

At timestep t=3, the contribution to the gradient stemming from \bar{s_1} is the following :
(Notice how the equation, derived by the chain rule, considers the contribution of \bar{s_1} to \bar{s_2} and \bar{s_3} . If you need, go back to the video to visualize the calculation path).

After considering the contributions from all three states: \bar{s_3} ,\bar{s_2} and \bar{s_1}, we will accumulate them to find the final gradient calculation.

The following equation is the gradient contributing to the adjustment of W_x using Backpropagation Through Time:

As mentioned before, in this example we had 3 time steps to consider, therefore we accumulated three partial derivative calculations. Generally speaking, we can consider multiple timesteps back. If you look closely at equations 1, 2 and 3, you will notice a pattern again. You will find that as we propagate a step back, we have an additional partial derivatives to consider in the chain rule. Mathematically this can be easily written in the following general equation for adjusting W_x using BPTT:

Notice the similarities between the calculations of \frac{\partial{E_3} }{\partial W_s} and \frac{\partial{E_3} }{\partial W_x}. Hopefully after understanding the calculation process of \frac{\partial{E_3} }{\partial W_s}, understanding that of \frac{\partial{E_3} }{\partial W_x} was straight forward.